Weighted parallel SGD for distributed unbalanced-workload training system

نویسندگان

Daning Cheng

Shigang Li

Yunquan Zhang

چکیده

Stochastic gradient descent (SGD) is a popular stochastic optimization method in machine learning. Traditional parallel SGD algorithms, e.g., SimuParallel SGD [1], often require all nodes to have the same performance or to consume equal quantities of data. However, these requirements are difficult to satisfy when the parallel SGD algorithms run in a heterogeneous computing environment; low-performance nodes will exert a negative influence on the final result. In this paper, we propose an algorithm called weighted parallel SGD (WPSGD). WP-SGD combines weighted model parameters from different nodes in the system to produce the final output. WP-SGD makes use of the reduction in standard deviation to compensate for the loss from the inconsistency in performance of nodes in the cluster, which means that WP-SGD does not require that all nodes consume equal quantities of data. We also analyze the theoretical feasibility of running two other parallel SGD algorithms combined with WP-SGD in a heterogeneous environment. The experimental results show that WP-SGD significantly outperforms the traditional parallel SGD algorithms on distributed training systems with an unbalanced workload.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be larg...

متن کامل

Staleness-Aware Async-SGD for Distributed Deep Learning

This paper investigates the effect of stale (delayed) gradient updates within the context of asynchronous stochastic gradient descent (Async-SGD) optimization for distributed training of deep neural networks. We demonstrate that our implementation of Async-SGD on a HPC cluster can achieve a tight bound on the gradient staleness while providing near-linear speedup. We propose a variant of the SG...

متن کامل

CuMF_SGD: Fast and Scalable Matrix Factorization

Matrix factorization (MF) has been widely used in e.g., recommender systems, topic modeling and word embedding. Stochastic gradient descent (SGD) is popular in solving MF problems because it can deal with large data sets and is easy to do incremental learning. We observed that SGD for MF is memory bound. Meanwhile, single-node CPU systems with caching performs well only for small data sets; dis...

متن کامل

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

We show empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback). This size reduction makes it feasible to parallelize SGD through data-parallelism with fast processors like recent GPUs. We implement data-par...

متن کامل

A Load Balancing Strategy for Iterated

An eecient template for the implementation on distributed-memory multiprocessors of iterated parallel loops, i.e. parallel loops nested in a sequential loop, is presented. The template is explicitly designed to smooth unbalanced processor workloads deriving from loops whose iterations are characterized by highly varying execution times. Experiments conducted shows performance gains w.r.t. HPF-l...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1708.04801 شماره

صفحات -

تاریخ انتشار 2017

Weighted parallel SGD for distributed unbalanced-workload training system

نویسندگان

چکیده

منابع مشابه

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Staleness-Aware Async-SGD for Distributed Deep Learning

CuMF_SGD: Fast and Scalable Matrix Factorization

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

A Load Balancing Strategy for Iterated

عنوان ژورنال:

اشتراک گذاری